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Abstract 

The Millennium Villages Project (MVP) is a ten-year integrated rural development project 
implemented in ten sub-Saharan African sites. At its conclusion we will conduct an evaluation 
of its causal effect on a variety of development outcomes, measured via household surveys 
in treatment and comparison areas. Outcomes are measured by six survey modules, with 
sample sizes for each demographic group determined by budget, logistics, and the group’s 
vulnerability. We design a sampling plan that aims to reduce effort for survey enumerators 
and maximize precision for all outcomes. We propose two-stage sampling designs, sampling 
households at the first stage, followed by a second stage sample that differs across demographic 
groups. Two-stage designs are usually constructed by simple random sampling (SRS) of 
households and proportional within-household sampling, or probability proportional to size 
sampling (PPS) of households with fixed sampling within each. No measure of household size 
is proportional for all demographic groups, putting PPS schemes at a disadvantage. The SRS 
schemes have the disadvantage that multiple individuals sampled per household decreases 
efficiency due to intra-household correlation. We conduct a simulation study (using both 
design- and model-based survey inference) to understand these tradeoffs and recommend a 
sampling plan for the Millennium Villages Project. Similar design issues arise in other studies 
with surveys that target different demographic groups. 


1 Background 


The Millennium Villages Project (MVP) is an economic development project that targets rural 
populations across ten countries in sub-Saharan Africa, implementing a multi-sector package of 


interventions at a village level (Sachs and McArthur, 2005 Sanchez et ah, 2007). See Mitchell 


et ah (2015a[) for background on the project, study site selection, outcomes of interest, and a 


comprehensive description of the plan to evaluate its effectiveness. Mitchell et ah (2015b) describe 
our plan for causal inference about the MVP’s effect on a variety of development outcomes measured 
in different demographic groups. These outcomes will be measured via survey modules administered 
in both treatment and comparison villages. 


A design analysis described in Mitchell et ah (2015b) was used to recommend the number 
of control villages and magnitude of sampling in each. Next, we must determine how to select 
households and individuals within households. We propose a two-stage sample: households will 


be sampled in stage I, followed by individuals within households in stage II (Lohr, 2010 Sarndal 


et ah, 1992). In the first stage, we must decide between simple random sampling and probability 
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proportional to size sampling of households. Because the project operates at the village level, a 
sampling plan that efficiently estimates outcome means per village is an efficient sampling plan for 
the overall causal evaluation. In this paper we conduct a simulation study to decide on a sampling 
plan for estimating hnite population village means. 

We aim to minimize the design effect, the ratio between the actual and effective sample sizes. 
One factor in determining the efficiency of a sampling design is the intraclass correlation, i.e. the 
correlation among individuals within a household. If more than one individual is sampled per 
household, the intraclass correlation increases the design effect, reducing the effective sample size 
relative to the actual sample size. 

Another factor in the efficiency of a sampling design is the distribution of individuals’ sampling 
probabilities. Sampling probabilities can be optimized for a specihc outcome, e.g. by sampling 
with probability approximately proportional to the outcome (Sarndal et ah, 1992, p.88). However, 


with many outcomes of interest, such tailored optimization is difficult or impossible. Therefore, a 
self-weighted sample design is preferred, such that all individuals are sampled with equal probability 
( Kish, 19"^ Lohr, 2010, p.287). Such samples are representative without weighting adjustments, 
and unbiased point estimates can be obtained from standard statistical procedures. 

Given a hxed precision, we aim to minimize time and resources for the survey enumerator teams. 
This includes minimizing the numbers of people surveyed (i.e. the actual sample size), but also 
considering the number of households visited, and the effort required to prepare a sampling frame. 
To conduct the hrst stage of sampling, a scheme that samples households with equal probability only 
requires a list of all households with GPS coordinates identifying their locations. However, a scheme 
which samples households with probability proportional to size requires some measure of household 
size (e.g. the total number of household members). This additional piece of information requires 
more effort for enumerators, especially for larger villages with many households. After either 
method of hrst stage sampling, we will conduct a demographic census in the sampled households 
to create the sampling frame for the second stage. 

In this paper we conduct a simulation study to understand the tradeoffs between simple random 
sampling and probability proportional to size sampling of households in the context of the MVP 
evaluation. Additionally, our simulations explore design-based versus model-based inference, a 
dichotomy which has implications for our the analysis of our outcome data. 


2 Outcomes and survey modules 

The Millennium Villages Project (MVP) dehnes 51 outcomes of interest, including measures of 
poverty alleviation, agriculture, education, gender equality, health, environmental sustainability. 


and infrastructure (Mitchell et ah, 2015a). These outcomes are measured in six different survey 


modules, whose content is discussed in Section ?? of Mitchell et al. (2015a). These modules include: 
• a household survey, administered to all household heads (or other knowledgeable household 
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members) within the sampled households; 

• a sex-specihc adult survey, administered to men and women of reproductive age (15 to 49 
years) within the sampled households; 

• within the adult-female survey, a birth history section, administered to women of reproduc¬ 
tive age (15 to 49 years) both in the sampled households and in additional sampled households 
to reach sample size sufficient for estimating child mortality; 

• a nutrition survey, administered to men and women age 15 to 49 years in sampled house¬ 
holds; 

• blood (malaria and anemia) testing, administered to four age-sex groups in sampled 
households: children age 6 to 59 months, school-aged children (5 to 14 years old), men age 
15 to 49 years, women age 15 to 49 years; and 

• anthropometry measurements, administered among children age 6 to 59 months in sampled 
households. 

For each module and age-sex group combination, the project has budgeted a target sample size 
based on a combination of budget, logistics, and relative importance of different vulnerable popu¬ 
lations and intervention beneficiaries. 


3 Sampling plans considered 

For the purpose of our simulation study, we consider all survey modules except for birth history and 
the nutrition survey. Our sampling will be performed in two phases. First, we will sample house¬ 
holds using either simple random sampling (SRS, without replacement) or probability proportional 
to size sampling (PPS, with replacement), with household size dehned as Nh, total, the number of 
household members under 50 years old in household h. Let Sj be the set of (unique) sampled 
households. In the PPS scheme, we use tj to denote the set of sampled households with repeats. 
Let Tii = |si| be the number of households sampled without replacement in the SRS scheme and 
let mi = Itil be the number of households sampled with replacement in the PPS scheme. We let 
TLi = mi = 300 based on the project’s previous survey rounds and budget for the hnal survey 
round. 

To describe the within-household sampling plans for each survey module, we use the following 
notation. Let be the total number of people in household H that are in the target age-sex 
group for a particular module. Let txh ^ Nh be the number of people in household H that we 
sample and survey. For example, if considering the anthropometry module, then Nh is the number 
of children under hve years of age in household H and tlh is the number of those sampled for the 
anthropometry module. Let N = LhNh be the total number of people in the sampling frame (an 
MVl or a control village) that are in the module’s target age-sex group. 
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We now outline the within-household sampling plans considered in our simulation study. 


Adult, anthropometry, and blood modnles 

For each module and age-sex group combination, the project has budgeted a target sample size, 

T^target• 

• adnlt survey - Utarget = 400 men and n^arget == 400 women of reproductive age (15 to 49 
years); 

• blood (malaria and anemia) - Utarget = 300 children age 6 to 59 months, Utarget = 100 
school-aged children (5 to 14 years old), Utarget = 100 men age 15 to 49 years, Utarget = 100 
women age 15 to 49 years; and 

• anthropometry - Utarget = 300 children age 6 to 59 months. 

The second stage sampling schemes we consider in this simulation study are, for a given module 

and age-sex group: 

• For SRS sampling of households - combine all Ngj = Unesi people in the sampled 
households in the target age-sex group. If Ns, ^ Warget, then survey all. Otherwise, we 
consider two options: 

— stratify by sampling uh individuals from each household, where uh is proportional (up 
to rounding) to N^, and the constant of proportionality is determined by the total in 
the sampled households. 


Uh = round ^Nh * 


— take an equal-probability systematic sample of utarget people. We order the house¬ 
holds randomly, and people (in the module’s target age-sex group) within households 
randomly, so that the people within a household are listed consecutively. We then take a 
sample using the fractional interval method described in Sarndal et ah (1992, p.77) and 
Appendix This procedure enables us to control sample sizes and spread the sample 
across households such that the sample size in a household is always either the ceiling or 
the floor of the expected sample size in that household under simple random sampling 
(see Appendix]^. Conceptually, this is similar to stratifying on household, except that 
there is dependence of the samples between strata (i.e. households). 


• For PPS sampling of households - if Utarget ^ uii, sample a fixed number of people, uh = 1, 
per household (regardless of household size) if available]^ If Utarget < uli, take a simple 
random sample of Utarget households from Tj to obtain a smaller PPS sample of households. 
Then sample Uh = 1 per household if available. 

**'It is possible that a household is sampled without any members of the target age-sex group. Therefore, if 
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Household survey 

For both the SRS and PPS schemes, the household survey module is administered to the head of 
household in each sampled household. 

3.1 Simulated data 

For each survey module, we simulate one outcome measured by that module. For the household 
survey we use the total household consumption; for the adult male survey we use the number of 
days after illness began when the man hrst sought advice or treatment; for the adult female survey 
we use the number of times a woman received antenatal care during her most recent pregnancy; 
for malaria and anemia testing we use hemoglobin blood concentration; for the anthropometry 
module we use the weight for age z-score. In generating simulated data, we make the simplifying 
assumption that all individuals in a target age-sex group have non-missing outcomes. For example, 
we generate antenatal care outcomes for all women of reproductive age. 

To generate data, we use the multilevel model 

Pi - Normal(an [i], cTy) for individuals i (1) 

(Xh -- Normal(p-F |3 iNh, total + (32NH,totan ^oc) for households H, 

For total household consumption we use a model analogous to model [T]p1 

tn ~ Normal(p-F (3 iNh, total + (32NH,totan Nh,totaiO-t) for households H. (2) 

We also use a model for the log total consumption (which in our data is more Normally distributed 
than the total consumption), 

log(th) -- Normal(p-F |3 iNh, total + |32Nh, totals o't) for households h. (3) 

We use the demographic information from the census and the multilevel model with estimated 
parameter values (from the survey data) to generate simulated populations. If when models 01 or 
l^are fit to past survey data, the 50% posterior interval of |3i or (32 contains 0, we set the parameter 
to 0 when simulating populations. This prevents us from using very noisy estimates of coefficients. 
Within each simulated population, we randomly sample according to the sampling plans described 
above, and estimate the hnite population mean using either model-based or design-based inference. 

Uh = 1, then the PPS scheme will result in a smaller sample size than the SRS scheme. Additionally, for the adult 
module (where Utarget = 400), if Uh = 1 then the PPS scheme will at most sample only 300 adults, one per sampled 
household. 

'’Modelj^can be motivated by assuming that model holds for individual-level consumption (this would assume 
that within a household consumption is identically distributed, not taking into account age-sex differences). This 
model implies that ~ Normal(NH,totai<Xh, N^,totally) and that the marginal variance of is 

Nh.totalfoa + '^u)- 
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4 Bayesian model-based inference 


To generalize from the data to the population, both design-based and model-based inference must 
take into account how the data are collected. Let y = ( 1 ^ 1 ,...,^^) denote data for the population 
of interest, and I = (Ii,..., In) indicators of the observation of y, where h = 1 if pi is sampled]^ 
and It = 0 if pi is missing. Let ‘obs’ = {i : h = 1} and ‘mis’ = {1: h = 0}. Thus, the information 
available is Pobs, b and the likelihood is p(pobs, I|x, 0, 4^) = J p(pk, 0)p(I|^,P, 4>)dpmis, where x are 
observed covariates. Bayesian inference computes the posterior distribution p(0, (t)|x, pobs, I) (su¬ 
perpopulation inference) and p(pmisl'X-, Pobs, b 0; (finite population inference). Under the ignora- 
bility condition, these inferences can be simplified to p(0|x, pobs) and ptpmisl^, Pobs, 0)- Ignorability 


is satisfied if both the missing at random and distinct parameters conditions are satisfied (Gelman 


et al., 2014, p.202, 206-211). Missing at random requires that the missingness be independent of the 
missing values conditional on observed variables and a parameter cf); p(I|x,p, (()) = p(I|x, pobs, 4^)- 
The distinct parameters condition requires that the parameters of the missingness mechanism (4)) 
be independent of the parameters of the data generating process (0), conditional on covariates: 
p(4)|x, 0) =p((t)|x). 

We include design variables such that the data collection mechanism is ignorable with respect 
to this model. For example, in our SRS-stratified sampling plan, the data collection mechanism is: 


p(I|x,p,4)) = 1/ 


z n 

siC{l,...,Ni} Kg Si 
. Isil=ni 


Nh 

Uh 


where tlh = round ( N 


n 


'N 


Si 


if 3si C {1,..., Nil s.t. |si| = TLi and X!i-H[i]=H b = round ( ) for all H G Sj. Otherwise, the 


probability of missingness pattern I is zero. 

Thus, we include as design variables the household identifiers and the Nh (e.g. the number 
of women per household, if the survey module targets women). Similar computations show that 
under the SRS-systematic sampling scheme these variables are also sufficient to satisfy missing 
at random. For the PPS scheme, we also will need the measure of household size used to select 
the households (e.g. the total number of household members under 50) (Gelman et ah, 2014 


p.211). For simplicity, we fit the same ignorable model for both the SRS and PPS schemes. For 
the anthropometry, blood, and adult survey modules we fit 


Pi ~ Normal((XH[i], cTy) for individuals i ( 4 ) 

(Xh ^ Normal(p,-F ( 3 iNh, total + 132 ^?,,total + Ps^h + o-^,) for households h. 

For the household survey, we fit models and 

Our parameter of interest is the finite population mean Y = -jq X(h=Li ^hPh, where pn = 
'^We assume that all units that are sampled are observed. 
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p.205). We obtain posterior simulations of Y as 
follows: if household H is sampled, we use a simulation of (Xh to generate Nh — "rih simulated pi’s. 
If household h. is not sampled, we use simulations of p. and cTcc to simulate a new (Xh, then generate 
Nh simulated pi’s. 


ItH 

Nh 


II -I- -t I 

yh,obs ' yH,mis 




5 Frequentist design-based inference 


We use the survey package to compute design-based estimates and variances (Lumley, 2004). 


Though we perform our SRS schemes without replacement, we compute all variances without 
hnite population corrections, using the Horvitz Thompson (Hajek) ratio estimator and its with- 


replacement variance (Lohr, 2010, p.247). 

Our SRS schemes are two-phase rather than two-stage designs, since the sampling within a 


household depends on which households were sampled in the first stage (Sarndal et ah, 1992 


p. 134-135). This dependence is reflected in the design weights we compute, see below. For the SRS- 
systematic sampling scheme, the independence assumption of two-stage sampling is also violated, 
with the sampling in each household dependent on the sampling in other households. Our design- 
based analysis approximates these two-phase designs with a two-stage analysis. In contrast, in 
model-based inference the details of the design do not matter in the analysis once we include 


design variables in our model (Gelman et ah, 2014, p.202, 206-211). 


5.1 Design weights 

For the SRS-systematic design, the inclusion probabilities are: 


TiHi = P [person i in household h. is sampled] 

= P[h, G Si)P[t G ShIH G Si) 

= ^ ^ P(i G Sh|H G Si,Si) * P(si|H G Si) 

^ Silhesi 



/ round i N h ) \ 

For SRS-stratified, we replace [*) with min ( -1 1. In the simulations, instead of com¬ 

puting this precisely, we estimate it by randomly sampling si such that h G Si. This avoids the 
computationally intensive loop over all such sets. Although these weights are not equal 

for all individuals, because the distributions of household sizes (from the MVP demographic data) 
have no extreme outliers, in our simulations the weights are nearly equal. 


7 













For the PPS scheme, the inclusion probabilities are: 


TTni = Ek [P (person i in household h. is sampled | household i is chosen k times)] 

= Ek[l — (1 — TLH/Nh,)'^] since we independently subsample a household as many times as it is drawn. 
Since k ~ Bin(mi,ph), by its probability generating function, we obtain 
= 1 - (ph( 1 - nH/Nh) + (1 - Ph))"^' 


\ Tn-i 

1 /I 

° n ^ 

if Ph^j^ is small, we can approximate this as: 
Nh 


= triPh 


n-h 


In PPS sampling, pn oc Xh, where xh is a measure of household size (Sarndal et ah, 1992, p.97). 
So the PPS weights are: 


WHi = 


L 


Heu 


Xh N 


H 


TTLiXh TIh 


If Xh cc Nh, and tih cc 1, then the design is self-weighted. We take Uh = c, a constant, but we 
cannot choose Xh such that Xh oc Nh for all modules, since the target age-sex groups differ from 
module to module. We chose Xh = Nh, total, the number of household members under 50 years of 
age, because it represented a compromise between the different target age-sex groups. Thus, our 
weights are wnt oc . 

K,total 


6 Comparisons between sampling schemes: variances and 
design effects 

We want to compare the PPS and SRS designs (in either the Bayesian model-based or the design- 
based paradigms). In general, the two schemes will have slightly different sample sizes, making 
direct comparisons of variances less relevant. For the household survey module, we £x the sample 
sizes to be equal, and for the adult, anthropometry, and blood modules, we adjust for the differing 
sample sizes by computing a design effect, defined below. 

The household survey module is administered to the heads of households only, not individual 
members. Therefore, the time cost of the household module is mostly determined by the number 
of households surveyed. We set up our simulations such that the number of household heads to be 
interviewed (i.e. sample size) is the same for the SRS and PPS sampling schemes. We hrst perform 
a PPS sampling of households. Then, we use the number of unique sampled households to obtain 
the number of households to sample for the SRS scheme. We then directly compare the variances 
in estimating Y, the hnite population mean consumption per person. 
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For the remaining modules, we compute design effects. To deffne the design effect (often 
abbreviated as “deff”), we first introduce the following notation. Let On = dTitblJobs) be the 
estimator of 0 (in our case, 0 = Y) where n is the sampling distribution assumed to have been used 
in drawing sample S. Let Vni{ 0 n 2 ]y) be the sampling variance of an estimator of 0 that assumes 
sampling distribution 7 I 2 , and 7ti is the distribution with respect to which we want the variance. 
Let 'Vn^iOn^) Tta; I,ffobs) be an estimator where 713 is the sampling distribution assumed to have been 
used in drawing sample S. The population design effect is defined as = Vp(0p;'y)/VsRs(0sRs;h)- 
The estimated design effect is defined as = Vp(0p;p;-yobs)/VsRs(0sRs;p;ffobs)- 

In the design-based setting, we compute design effects assuming sampling with-replacement in 
both numerator and denominator variances. This is done in the survey package by specifying deff 
= ‘replaceL 

For the model-based simulations, we estimate the numerator of the deff with the posterior 
variance for Y from fitting a model that includes enough design variables such that the data 
collection mechanism is ignorable with respect to this model. This posterior variance includes an 
implicit finite population correction, so we compute a denominator variance that also includes such 
a correction: 


VsRs(0SRs;p) — VsRs(p;p) 


(5) 


N/ n 


where 




N 


To assess our estimated deff in the model-based setting, we compare the posterior variance 
Vp -ignorable (0 Ipobs) from fitting an ignorable model with respect to a sampling distribution p to 
the design-based sampling variance of the posterior means, Lp_ignorabie(0|pobs)- The latter can be 
computed by simulation: we sample repeatedly from the full population using distribution p, fit 
the p-ignorable model, obtain a posterior mean of 0, and compute the variance of these across 


the samples from p. Fixing one finite population, in Figure we create a histogram of posterior 
variances from fitting the p-ignorable model to each sample, and indicate with a vertical line the 
design-based variance of the posterior means, which is computed by simulation. We make the same 
comparison for p = a simple random sample (and its ignorable model with fiat priors and no design 
variables), and include the closed-form design-based estimate ([^ as a vertical line, in addition to 
the simulation-computed design-based estimate. See Figure p!bj We see that the posterior variances 
appear unbiased for the design-based variances. 


7 Simulation results 

Our results are displayed in Appendix where we see that neither the SRS nor PPS sampling of 
households is more efficient (i.e. has a lower design effect) in general. 
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posterior variances of Y 
under repeated sampling from II Yy 


posterior variances of Y 
under repeated sampling from 11 Yu 




(a) Sampling distribution p is SRS sampling of (b) Sampling distribution'p is SRS sampling of peo- 
households followed by an equal-probability system- pie. 
atic sample within households. 

Figure 1: Fixing one finite population, we show a histogram of posterior variances from fitting a p-ignorable 
model to eaeh sample using sampling distribution p, and indieate in a vertical line the design-based variance 
of the posterior means, which is computed by simulation. When p is simple random sample of people, we 
also inelude the closed-form design-based estimate 

We see that for modules with higher target sample sizes, SRS tends to be less efficient. For 
example, in the under-5 blood (utarget = 300) and adult (ritarget = 400) modules the SRS scheme is 
less efficient. One explanation for this observation is the different numbers of people sampled per 
household in the SRS versus PPS schemes, which has efficiency implications due to the intra-house 
correlation. In the PPS scheme, the households sampled in the first stage are larger and therefore 
more likely to include people in the target demographics. In contrast, in the SRS scheme, the 
sample is often drawn from fewer households, with more people sampled per household. Moreover, 
the PPS scheme only samples one person per household draw (though this can result in more than 
one person being sampled per household due to the with-replacement sampling at the hrst stage). 

For modules where the target sample size is low, there are fewer people sampled per household in 
the SRS sampling scheme, and the intra-house correlation does not substantially impact the design 
efficiency. Therefore, because SRS has near-equal individual-level probability of sampling (see the 
design-weights computed above), its design effect in the absence of household clustering should 
be close to one. In contrast, the PPS scheme does not have near-equal individual-level sampling 
probabilities because the measure of household size is not proportional to the target demographic 
(see the design-weights computed above). 

The relative efficiency of SRS versus PPS is similar between design-based and model-based 
simulations. In the few cases where they differ, design-based results show that SRS has higher 
design effects than PPS, relative to model-based results. In general, our model-based simulations 
show more variability across simulations than the design-based simulations. Comparing systematic 
to stratified sampling at the second stage of the SRS schemes, we see few differences except that 
stratihed sampling tends to have higher variance across simulations. 
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8 Final sampling plan 


As described above, the PPS scheme requires a sampling frame that includes household sizes, 
whereas the SRS scheme only requires a list of households. Given our results, we cannot justify 
the additional resources required to collect the more detailed household list for the PPS scheme. 
Therefore, our sampling scheme will begin with an SRS sample of households. For the second stage 
of sampling for the adult, anthropometry, and blood modules, we prefer the control over sample 
size achieved by systematic sampling (as opposed to stratihed sampling). 

The household and nutrition modules follow a different sampling scheme. As mentioned above, 
the household module is administered to all household heads (or other knowledgeable household 
members) within the sampled households. The nutrition module consists of a food frequency 
questionnaire, which takes longer to administer than other modules. We suspect that the within- 
household correlation is very high for data on food frequency, because household members are likely 
to eat similar foods. (This intra-house correlation cannot be measured from project data, because 
the project has always limited this module to one member per household.) For these reasons, we 
limit the nutrition module to one adult (age 15 to 49 years) per household. 


9 Software 


For htting multilevel models we use Stan in R, (Stan Development Team, 2013 R Development 


Core Team, 2014). 
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A Properties of systematic sampling 


Definition 1 (The fractional interval method of systematic sampling). Consider a population 
of size N* consisting of people grouped into Ui households indexed by H, with people within- 
household h. Let Utaraet < N* &e the desired sample size. Set a = ——. Order the households 

^ Tltarget 

randomly, and randomly order the people in the target group within the households. Let k = 1,N* 
label the people in this order: 


1.Ni jNi + 1). (Ni + N2)^ . 

household 1 household 2 household tli 


Draw a random real number £, uniformly between 0 and a, £, ~ U(0, a), and sample all people with 
k such that 

k - 1 <£,+ (j - l)a ^ k/or j =: 1, Utarget- 


(Sdrndal et al, 1992. p.77) 


Claim 1. When performing the sampling scheme in Definition^ the sample size will be Utarget- 
Proof of ClaimUl Since a = —— and N* > ntaraet, a > 1. Since k — 1 < x ^ k [xl = k, we 

•—• TLtarget ^ 

can write [£, + (j — l)a] = k. The ceiling function is monotone increasing and [x + 1] = [x] +1, so 
each time j increases by 1, we get a different value of k. Now we must show that the k’s stay in the 
set {1,N*}, i.e. those from which we are sampling. The first k is such that k— 1 < £, ^ k, where 
£, G (0, a). Since £, > 0, we must have k ^ 1. The last k is such that k—1 < £,+ (rLtQrget~l)‘i ^ l <-5 
and we know £, + (utarget — l)a ^ £, + N* — a < N* because £, < a. Then k ^ N*. Thus, since 
each j maps to a unique k, we’ve proven we get a sample size of exactly ntarget- □ 


Claim 2. When performing the sampling scheme in Definition the sample size within each 
household h. is always the ceiling or the floor of the expected sample size in household H under 
simple random sampling: 

We first prove the following lemma which is used to prove the above claim: 

Lemma 1. Consider the set A(x) = {j G Z+|£,+ (j — l)a ^ x}. The maximum of A{x) is ^ if 
^ G Z, if L> da, or if L ^ da, where d = — |_^J , the “decimal part. ” 

Proof of Lemma\^ If G Z, let j = and we see that £,+ (^ — l)a = £, + x—a^x because 
£, < a, so ^ G A(x). Increasing j by 1 increases the lefthand side of the inequality by a, and since 
£, > 0, we see this ^ + 1 ^ A(x). Therefore, ^ is the maximum. 

If - ^ Z, let d = - — I-J , the “decimal part.” We see that £,+ (I-J — l) a < £,+ — l) a = 

a a a a +da 

£, + X — a ^ X, so I-I G A(x). Increasing j by 1 (to I"-]) increases the leftmost side of the 

+ a-£, “ “ 

inequality by a. If a ^ da + (a — i.e. if ^ da, then G A(x). □ 
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Proof of Claim\^ Consider household H, of size Nh- Let k* be the last person before household 
h in the ordering used by the systematic sampling. Then k* + 1, ...,k* + Nh are the indices for 
all members of household H. In order to get the number sampled in household h, we consider the 
maximum of set A(k* + Nh) (the number of people sampled up through household H) and subtract 
from it the maximum of set A(k*) (the number of people sampled before household H). This gives, 
by Lemmafll LlThL _ hi = Nh □ 

I_f a a a 


Claim 3. The sampling scheme in definition\^is self-weighted. 
Proof. 


P(person k is sampled) = P (3j G {1,rttarget} s.t. k—l<£,+ (j — l)a^k) 

= P 1 - (j - l)a < £, ^ k- (j - l)a}) 

each interval is length 1 [k — (j — l)a] — [k — 1 — (j — l)a] 
space between intervals j and j+1 is a — 1 = [k — 1 — j * a] — [k — (j — l)a] 
so by the picture below, we see that (0, a) has overlaps with the intervals 

of length totaling 1. So since £, ~ U(0, a), 

_ 1 

a 


See below for a visual, in orange is the interval (0,a), which can overlap at most 2 intervals of 
length 1 (shown as over-braces, with overlaps totaling a length 1: 

j — 1, length=l j=2, length=l j=3, length=l 3=4, length=l 

^ ^ ^ ^ 

^ ^ 

'V' “V^ “V^ 

a a a a 


□ 
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Figure 2: Design-based adult module results 


15 






















































Design effects 

across simulated populations and samples 
for blood, under S, 1 per PPS draw, 


Uganda - 

systematic 

Nigeria - 


Senegal - 


Ghana - 

-X*- 

Tanzania - 
Kenya - 



Malawi - 


Mali - 

““•"X- 


Rwanda 


0.5 1 


deff 

(a) 

Sample sizes 

across simulated populations and samples 
for blood, under S, 1 per PPS draw, 
systematic 

Uganda 
Nigeria 
Senegal 
Ghana 
Tanzania 
Kenya 
Malawi - 


Design effects 

across simulated populations and samples 
for blood, schooLage, 1 per PPS draw, 
systematic 



Uganda - 




Nigeria - 

X*- 



Senegal - 




Ghana - 




Tanzania - 




Kenya - 




Malawi - 

x-«- 


^ SRS 

Mali - 


SRS 

• PPS 

Rwanda - 


• PPS 


0.5 1 2 3 

deff 

(b) 

Sample sizes 

across simulated populations and samples 
for blood, schooLage, 1 per PPS draw, 
systematic 

Ugandan • x 

Nigeria- • x 

Senegal - "x 

Ghana- ♦x 

Tanzania - • x 

Kenya - • x 

Malawi - • X 


Mali - 

♦ x 

^ SRS 

Mali - 


^ SRS 

Rwanda - 

X 

• PPS 

Rwanda - 

X 

• PPS 


200 


400 


(c) 

Effective sample sizes 
across simulated populations and samples 
for blood, under S, 1 per PPS draw, 
systematic 

Uganda 
Nigeria 
Senegal 
Ghana 
Tanzania 
Kenya 
Malawi 
Mali 

Rwanda ^ 


200 

n_eff 

(e) 


400 


(d) 

Effective sample sizes 
across simulated populations and samples 
for blood, schooLage, 1 per PPS draw, 
systematic 

Ugandan * 

Nigeria - ♦ x 

Senegal - -x 

Ghana- • x 

Tanzania - =*=x=— 

Kenya - -- x 

Malawi - • X 


n_eff 

(f) 


SRS 

Mali - 


^ SRS 

• PPS 

Rwanda - 

- X 

• PPS 


200 


Figure 3: Design-based blood (malaria and anemia) module results: under 5 and school-age children 
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Figure 4-' Design-based blood (malaria and anemia) module results: men and women 
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Figure 5: Design-based anthro module results 
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Figure 6: Design-based adult module results 
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Figure 7; Design-based blood (malaria and anemia) module results: under 5 and school-age children 
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Figure 8: Design-based blood (malaria and anemia) module results: men and women 
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Figure 9: Design-based anthro module results 
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Figure 10: 


Design-based consumption module results 
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B.2 Model-based Results 
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Figure 11: Model-based adult module results 
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Figure 12: Model-based blood (malaria and anemia) module results: under 5 and school-age children 
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Figure 13: 


Model-based blood (malaria and anemia) module results: men and women 
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Figure 14: Model-based anthropometry module results 
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Figure 15: Model-based adult module results 
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Figure 16: Model-based blood 


(malaria and anemia) module results: under 5 and school-age children 
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Figure 17: Model-hased blood (malaria and anemia) module results: men and women 
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Figure 18: Model-based anthropometry module results 
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Figure 19: Model-based consumption module results 
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